-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add local caching of Function
s, by reference, in the PDFFunctionFactory
(issue 2541)
#12034
Add local caching of Function
s, by reference, in the PDFFunctionFactory
(issue 2541)
#12034
Conversation
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/3887112ee88ead3/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/694bc80c7d35ba6/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/3887112ee88ead3/output.txt Total script time: 25.84 mins
Image differences available at: http://54.67.70.0:8877/3887112ee88ead3/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/694bc80c7d35ba6/output.txt Total script time: 29.01 mins
Image differences available at: http://54.215.176.217:8877/694bc80c7d35ba6/reftest-analyzer.html#web=eq.log |
347facc
to
57a292a
Compare
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/f05c82d485dd480/output.txt |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/f05c82d485dd480/output.txt Total script time: 29.34 mins
Image differences available at: http://54.215.176.217:8877/f05c82d485dd480/reftest-analyzer.html#web=eq.log |
Function
s, by reference, in the PDFFunctionFactory
Function
s, by reference, in the PDFFunctionFactory
(issue 2541)
57a292a
to
9fa5b6f
Compare
…ctory` (issue 2541) Note that compared other structures, such as e.g. Images and ColorSpaces, `Function`s are not referred to by name, which however does bring the advantage of being able to share the cache for an *entire* page. Furthermore, similar to ColorSpaces, the parsing of individual `Function`s are generally fast enough to not really warrant trying to cache them in any "smarter" way than by reference. (Hence trying to do caching similar to e.g. Fonts would most likely be a losing proposition, given the amount of data lookup/parsing that'd be required.) Originally I tried implementing this similar to e.g. the recently added ColorSpace caching (and in a couple of different ways), however it unfortunately turned out to be quite ugly/unwieldy given the sheer number of functions/methods where you'd thus need to pass in a `LocalFunctionCache` instance. (Also, the affected functions/methods didn't exactly have short signatures as-is.) After going back and forth on this for a while it seemed to me that the simplest, or least "invasive" if you will, solution would be if each `PartialEvaluator` instance had its *own* `PDFFunctionFactory` instance (since the latter is already passed to all of the required code). This way each `PDFFunctionFactory` instances could have a local `Function` cache, without it being necessary to provide a `LocalFunctionCache` instance manually at every `PDFFunctionFactory.{create, createFromArray}` call-site. Obviously, with this patch, there's now (potentially) more `PDFFunctionFactory` instances than before when the entire document shared just one. However, each such instance is really quite small and it's also tied to a `PartialEvaluator` instance and those are *not* kept alive and/or cached. To reduce the impact of these changes, I've tried to make as many of these structures as possible *lazily initialized*, specifically: - The `PDFFunctionFactory`, on `PartialEvaluator` instances, since not all kinds of general parsing actually requires it. For example: `getTextContent` calls won't cause any `Function` to be parsed, and even some `getOperatorList` calls won't trigger `Function` parsing (if a page contains e.g. no Patterns or "complex" ColorSpaces). - The `LocalFunctionCache`, on `PDFFunctionFactory` instances, since only certain parsing requires it. Generally speaking, only e.g. Patterns, "complex" ColorSpaces, and/or (some) SoftMasks will trigger any `Function` parsing. To put these changes into perspective, when loading/rendering all (14) pages of the default `tracemonkey.pdf` file there's now a total of 6 `PDFFunctionFactory` and 1 `LocalFunctionCache` instances created thanks to the lazy initialization. (If you instead would keep the document-"global" `PDFFunctionFactory` instance and pass around `LocalFunctionCache` instances everywhere, the numbers for the `tracemonkey.pdf` file would be instead be something like 1 `PDFFunctionFactory` and 6 `LocalFunctionCache` instances.) All-in-all, I thus don't think that the `PDFFunctionFactory` changes should be generally problematic. With these changes, we can also modify (some) call-sites to pass in a `Reference` rather than the actual `Function` data. This is nice since `Function`s can also be `Streams`, which are not cached on the `XRef` instance (given their potential size), and this way we can avoid unnecessary lookups and thus save some additional time/resources. Obviously I had intended to include (standard) benchmark results with these changes, but for reasons I don't really understand the test run-time (even with `master`) of the document in issue 2541 is quite a bit slower than in the development viewer. However, logging the time it takes for the relevant `PDFFunctionFactory`/`PDFFunction ` parsing shows that it takes *approximately* `0.5 ms` for the `Function` in question. Looking up a cached `Function`, on the other hand, is *one order of magnitude faster* which does add up when the same `Function` is invoked close to 2000 times.
9fa5b6f
to
ca719ec
Compare
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/6d1f9254395cb43/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/97dd93abf9c78ef/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/97dd93abf9c78ef/output.txt Total script time: 26.21 mins
Image differences available at: http://54.67.70.0:8877/97dd93abf9c78ef/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/6d1f9254395cb43/output.txt Total script time: 29.07 mins
Image differences available at: http://54.215.176.217:8877/6d1f9254395cb43/reftest-analyzer.html#web=eq.log |
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/82b629c7a831a16/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/82b629c7a831a16/output.txt Total script time: 3.33 mins Published |
Thank you for improving this! |
Note that compared other structures, such as e.g. Images and ColorSpaces,
Function
s are not referred to by name, which however does bring the advantage of being able to share the cache for an entire page.Furthermore, similar to ColorSpaces, the parsing of individual
Function
s are generally fast enough to not really warrant trying to cache them in any "smarter" way than by reference. (Hence trying to do caching similar to e.g. Fonts would most likely be a losing proposition, given the amount of data lookup/parsing that'd be required.)Originally I tried implementing this similar to e.g. the recently added ColorSpace caching (and in a couple of different ways), however it unfortunately turned out to be quite ugly/unwieldy given the sheer number of functions/methods where you'd thus need to pass in a
LocalFunctionCache
instance. (Also, the affected functions/methods didn't exactly have short signatures as-is.)After going back and forth on this for a while it seemed to me that the simplest, or least "invasive" if you will, solution would be if each
PartialEvaluator
instance had its ownPDFFunctionFactory
instance (since the latter is already passed to all of the required code). This way eachPDFFunctionFactory
instances could have a localFunction
cache, without it being necessary to provide aLocalFunctionCache
instance manually at everyPDFFunctionFactory.{create, createFromArray}
call-site.Obviously, with this patch, there's now (potentially) more
PDFFunctionFactory
instances than before when the entire document shared just one. However, each such instance is really quite small and it's also tied to aPartialEvaluator
instance and those are not kept alive and/or cached. To reduce the impact of these changes, I've tried to make as many of these structures as possible lazily initialized, specifically:The
PDFFunctionFactory
, onPartialEvaluator
instances, since not all kinds of general parsing actually requires it. For example:getTextContent
calls won't cause anyFunction
to be parsed, and even somegetOperatorList
calls won't triggerFunction
parsing (if a page contains e.g. no Patterns or "complex" ColorSpaces).The
LocalFunctionCache
, onPDFFunctionFactory
instances, since only certain parsing requires it. Generally speaking, only e.g. Patterns, "complex" ColorSpaces, and/or (some) SoftMasks will trigger anyFunction
parsing.To put these changes into perspective, when loading/rendering all (14) pages of the default
tracemonkey.pdf
file there's now a total of 6PDFFunctionFactory
and 1LocalFunctionCache
instances created thanks to the lazy initialization.(If you instead would keep the document-"global"
PDFFunctionFactory
instance and pass aroundLocalFunctionCache
instances everywhere, the numbers for thetracemonkey.pdf
file would be instead be something like 1PDFFunctionFactory
and 6LocalFunctionCache
instances.)All-in-all, I thus don't think that the
PDFFunctionFactory
changes should be generally problematic.With these changes, we can also modify (some) call-sites to pass in a
Reference
rather than the actualFunction
data. This is nice sinceFunction
s can also beStreams
, which are not cached on theXRef
instance (given their potential size), and this way we can avoid unnecessary lookups and thus save some additional time/resources.Obviously I had intended to include (standard) benchmark results with these changes, but for reasons I don't really understand the test run-time (even with
master
) of the document in issue 2541 is quite a bit slower than in the development viewer.However, logging the time it takes for the relevant
PDFFunctionFactory
/PDFFunction
parsing shows that it takes approximately0.5 ms
for theFunction
in question. Looking up a cachedFunction
, on the other hand, is one order of magnitude faster which does add up when the sameFunction
is invoked close to 2000 times.